Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
| Name | yrbss |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.0 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.6 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.2 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.0 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.0 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
Employing the skim and glimpse functions above, we observe that in the data set there are 1004 missing values. These will be filtered out, and the data processed in the following lines:
new_yrbss <- drop_na(yrbss) # Eliminate missing values in the data
weight_data <- new_yrbss %>%
select(weight) %>% # Construct the summary statistics table
summarise(mean_weight= mean(weight, na.rm= TRUE),
median_weight= median(weight, na.rm= TRUE),
std_weight= sd(weight,na.rm= TRUE),
count = n(),
t_critical = qt(0.975, count-1),
se_weight = std_weight/sqrt(count),
margin_of_error = t_critical * se_weight,
CI_weight_low = mean_weight - margin_of_error,
CI_weight_high = mean_weight + margin_of_error)
print.data.frame(head(weight_data)) # Present the summary statistics## mean_weight median_weight std_weight count t_critical se_weight
## 1 68.2 65.8 16.9 8351 1.96 0.185
## margin_of_error CI_weight_low CI_weight_high
## 1 0.364 67.8 68.6
ggplot(new_yrbss, aes(x=weight))+
geom_density(alpha=0.2) +
theme_bw() +
labs (
title = "Density Plot for Weight",
y = "Density")When we use visualization and look at the graph we can tell that the data is positively skewed. We would know this from our summary statistics because our mean is greater than our median.
We can then continue to process the data to produce tibbles more conducive to plotting boxplots for school children weight vs activity level:
ggplot(new_yrbss, aes(y= weight, x= physically_active_7d))+
facet_wrap(vars(physically_active_7d)) +
geom_boxplot() +
theme_bw() +
labs (
title = "Weight vs. Physical Activity",
x= "Days of Activity") +
NULLschool_weight_yrbss <- new_yrbss %>% # Creating separate table for those
# active 3+ days
mutate(physical_3plus = ifelse(physically_active_7d >=3, "YES", "NO")) %>%
count(physical_3plus) %>%
pivot_wider(names_from = physical_3plus,
values_from = n )
new_school_weight_yrbss <- school_weight_yrbss %>%
mutate(total = YES + NO ,
yes_proportion = YES / total,
no_proportion = 1 - yes_proportion)
print.data.frame (head(new_school_weight_yrbss))## NO YES total yes_proportion no_proportion
## 1 2656 5695 8351 0.682 0.318
Using this data, we will create a 95% confidence interval for the mean value of the weights of the school children:
new_school_weight_yrbss %>%
mutate(se = sqrt(no_proportion * (1-no_proportion)/ total), # Standard error
lower= no_proportion - 1.96*se,
upper= no_proportion + 1.96*se
)## # A tibble: 1 × 8
## NO YES total yes_proportion no_proportion se lower upper
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2656 5695 8351 0.682 0.318 0.00510 0.308 0.328
Through the boxplot, we observe there is a relationship between the two variables. If we focus on the box alone in the boxplot, we see that those who engaged in physical activities for 0-2 days range from about 55-75, while those who exercise for at least 3 days weigh about 65-78. This would mean that those who exercise less actually weigh less as well. However, this is not taking into account the whiskers in the graph. Though out original expectation was that those who exercised less would weigh more, the data presented can be understood as showing the relation between regular exercise and higher levels of muscle mass.
box_plot_data <- new_yrbss %>%
mutate(physical_3plus= ifelse(physically_active_7d >=3, "YES", "NO"))
# Creating a new column for the desired data
ggplot(box_plot_data, aes(y= weight, x= physical_3plus))+
geom_boxplot() +
theme_bw() +
labs (
title = "Weight vs. Physical Activity",
x= "Days of Activity") +
NULLBoxplots show how the medians of the two distributions compare, but we will also compare the means of the distributions using either a confidence interval or a hypothesis test.
yes_no_weight_data <- box_plot_data %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight),
median_weight = median(weight),
std_weight = sd(weight),
count = n(),
t_critical = qt(0.975, count-1),
se_weight = std_weight/sqrt(count),
margin_of_error = t_critical * se_weight,
CI_weight_low = mean_weight - margin_of_error,
CI_weight_high = mean_weight + margin_of_error)
print.data.frame(head(yes_no_weight_data)) # Printing the summary statistics## physical_3plus mean_weight median_weight std_weight count t_critical
## 1 NO 67.1 62.6 18.0 2656 1.96
## 2 YES 68.7 65.8 16.4 5695 1.96
## se_weight margin_of_error CI_weight_low CI_weight_high
## 1 0.349 0.684 66.5 67.8
## 2 0.218 0.426 68.3 69.1
We can express our null hypothesis (H0) and our alternative (H1) as:
Null: mean_weight(YES) = mean_weight(NO) Alternative: mean_weight(YES) != mean_weight(NO)
##
## Welch Two Sample t-test
##
## data: weight by physical_3plus
## t = -4, df = 4781, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.334 -0.721
## sample estimates:
## mean in group NO mean in group YES
## 67.1 68.7
inferRunning the same test using a specialised package:
obs_diff <- box_plot_data %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("YES", "NO")) # infer package
print.data.frame(head(obs_diff))## stat
## 1 1.53
null_dist <- box_plot_data %>%
# specify variables
specify(weight ~ physical_3plus) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("YES", "NO"))We can visualize this null distribution with the following code:
And expand on it using the following code:
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
In this section, we will reproduce the above graph, and perform hypothesis tests on the mean ratings of the films of Steven Spielberg and Tim Burton. Our H0 and H1 are outlined below:
Null: mean_rating (Steven Spielberg) = mean_rating (Tim Burton) Alternative: mean_rating (Steven Spielberg) != mean_rating (Tim Burton) Test-statistic: 3 (>2, so reject Null, true ratings do differ)
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
updated_movie <- movies %>%
filter(director== "Steven Spielberg" | director== "Tim Burton" )
# Filtered the movies data set into a smaller, more focused dataframe
new_movie <- updated_movie %>%
group_by(director) %>%
summarise (mean_rating= mean(rating),
median_rating= median(rating),
std_rating= sd(rating),
count = n(),
t_critical = qt(0.975, count-1),
se_rating = std_rating/sqrt(count),
margin_of_error = t_critical * se_rating,
CI_rating_low = mean_rating - margin_of_error,
CI_rating_high = mean_rating + margin_of_error)
t.test(rating ~ director, data = updated_movie)##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
movies_dist <- updated_movie %>%
specify(rating ~ director) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
obs_diff_movie <- updated_movie %>%
specify(rating ~ director) %>%
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
movies_dist %>% visualize() +
shade_p_value(obs_stat = obs_diff_movie, direction = "two-sided")## director mean_rating median_rating std_rating count t_critical
## 1 Steven Spielberg 7.57 7.6 0.695 23 2.07
## 2 Tim Burton 6.93 7.0 0.749 16 2.13
## se_rating margin_of_error CI_rating_low CI_rating_high
## 1 0.145 0.301 7.27 7.87
## 2 0.187 0.399 6.53 7.33
round_df <- function(x, digits) {
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x}
rounded_movie <- round_df(new_movie, 2)
#Reproduce the graph
ggplot(rounded_movie, aes(x= mean_rating, y= director, color= director)) +
geom_point(size= 3) +
geom_errorbar(aes(xmin= CI_rating_low, xmax= CI_rating_high, color= director),
width=0.1, size=1) +
geom_rect(aes(xmax= 7.33, xmin= 7.27, ymin=0, ymax= 3, color= "grey") ,
alpha= 0.3) +
theme_bw() +
labs( title= "Do Spielberg and Burton have the same mean IMDB ratings?",
subtitle= "95% confidence intervals overlap",
x= "Mean IMDB rating",
y=" Directors") +
scale_color_manual(values = c("grey", "brown2", "turquoise3")) +
scale_y_discrete(limits = c("Tim Burton","Steven Spielberg")) +
theme(legend.position = "none") +
geom_text_repel(aes(label= mean_rating), color="black", size=6) +
geom_text_repel(aes(label= CI_rating_low),
position = position_nudge(x = -0.35), color="black") +
geom_text_repel(aes(label= CI_rating_high),
position = position_nudge(x = 0.3), color="black") Ultimately, we can reject the null hypothesis, and show that at a 95% confidence level, the mean rating for Steven Spielberg films is higher than that of Tim Burton
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
The data frame omega contains the salaries for the sample of 50 executives in the company. We will analyse the dataset and investigate to see if there is a significant difference between the salaries of the male and female executives.
Null: mean_salary(female) = mean_salary(male) Alternative: mean_salary(female) != mean_salary(male)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
omega_updated <- omega %>%
group_by(gender) %>%
summarise(mean_salary= mean(salary),
median_salary= median(salary),
std_salary= sd(salary),
count = n(),
t_critical = qt(0.975, count-1),
se_salary = std_salary/sqrt(count),
margin_of_error = t_critical * se_salary,
CI_salary_low = mean_salary - margin_of_error,
CI_salary_high = mean_salary + margin_of_error)
print.data.frame(head(omega_updated)) ## gender mean_salary median_salary std_salary count t_critical se_salary
## 1 female 64543 64618 7567 26 2.06 1484
## 2 male 73239 74675 7463 24 2.07 1523
## margin_of_error CI_salary_low CI_salary_high
## 1 3056 61486 67599
## 2 3151 70088 76390
We can conclude that men are paid more than women at Omega Group. Confidence intervals for men and women do not overlap. This means the true means for men and women cannot be same.
Even though this differences could be due to factors other than gender, there is still reason for concern. The issue should be analysed further.
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
omega_dist <- omega %>%
specify(salary ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff_gender <- omega %>%
specify(salary ~ gender) %>%
calculate(stat = "diff in means", order = c("female", "male"))
omega_dist %>% visualize() +
shade_p_value(obs_stat = obs_diff_gender, direction = "two-sided")Some may raise the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Null: mean_experience (male) = mean_experience (female) Alternative: mean_experience (male) != mean_experience (female)
We coded to derive the t-value and p-value with the codes and results provided below:
experience_gender <- omega %>%
group_by(gender) %>%
summarise(mean_experience= mean(experience),
median_experience= median(experience),
std_experience= sd(experience),
count = n(),
t_critical = qt(0.975, count-1),
se_experience = std_experience/sqrt(count),
margin_of_error = t_critical * se_experience,
CI_experience_low = mean_experience - margin_of_error,
CI_experience_high = mean_experience + margin_of_error)
print.data.frame(head(experience_gender))## gender mean_experience median_experience std_experience count t_critical
## 1 female 7.38 3.0 8.51 26 2.06
## 2 male 21.12 19.5 10.92 24 2.07
## se_experience margin_of_error CI_experience_low CI_experience_high
## 1 1.67 3.44 3.95 10.8
## 2 2.23 4.61 16.52 25.7
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
# hypothesis testing using infer package
omega_dist_experience <- omega %>%
specify(experience ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff_experience <- omega %>%
specify(experience ~ gender) %>%
calculate(stat = "diff in means", order = c("female", "male"))
omega_dist_experience %>% visualize() +
shade_p_value(obs_stat = obs_diff_experience, direction = "two-sided")Since the t value is -5 and p-value < 0.05, we reject the null. So, yes, there is a difference between the average experience of the male and female executives.
If someone at the meeting argue that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
ggplot(omega, aes(x= experience, y=salary)) +
geom_point() +
geom_smooth() +
theme_bw() +
labs(title= "Salary vs. Experience",
x="Experience",
y= "Salary") +
NULLexperience_salary <- omega %>%
group_by(experience) %>%
summarise(mean_salary= mean(salary), # Summary statistics
median_salary= median(salary),
std_salary= sd(salary),
count = n(),
t_critical = qt(0.975, count-1),
se_salary = std_salary/sqrt(count),
margin_of_error = t_critical * se_salary,
CI_experience_low = mean_salary - margin_of_error,
CI_experience_high = mean_salary + margin_of_error)
print.data.frame(head(experience_salary))## experience mean_salary median_salary std_salary count t_critical se_salary
## 1 0 55461 55490 5111 7 2.45 1932
## 2 1 63174 63174 511 2 12.71 361
## 3 2 62839 63948 2783 4 3.18 1392
## 4 3 61816 63948 6261 3 4.30 3615
## 5 4 66114 66114 NA 1 NaN NA
## 6 5 62194 62194 NA 1 NaN NA
## margin_of_error CI_experience_low CI_experience_high
## 1 4727 50734 60187
## 2 4587 58587 67761
## 3 4429 58410 67268
## 4 15552 46264 77368
## 5 NaN NaN NaN
## 6 NaN NaN NaN
It can be seen that there is indeed a positive correlation between salary and experience. However, the significance of this statement faces several challenges and requires further analysis.
Firstly, as can be seen in the scatter plot, the 95% confidence interval is painted in grey. Many of the plots in the graph lay outside such intervals, which challenges the validity of the result in the 95% confidence interval. A more in-depth hypothesis test is expected to test the significance of this result.
Secondly, the graph shows that, when experience is higher than around 30, there may be a negative relationship between the two parameters. This can be reasonably explained with retirement or age (high age may be seen as a negative factor when deciding the salary in certain industries). To which extent this relationship validates should also be taken into consideration.
Thirdly, even if the positive relationship stands and survives from the two challenges above, it is unclear if salary increase along with experience, or experience increase with salary. It makes sense that more experienced workers can earn higher salaries due to higher work efficiency or productivity, but it also makes sense that people tend to stay longer in the industry if the salary is high enough.
Therefore, we can generally conclude that the graph above shows a positive correlation between salary and experience in general, but the validity of this statement/relationship required more in-depth analysis.
We can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).
omega %>%
select(gender, experience, salary) %>% #order variables they will appear
# in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()The scatterplot shows that there is a somewhat positive relationship between salary and experience for both males and females. However, this positive relationship between both variables is more obvious for males.
On the contrary, for females, the positive relationship does not appear to be as strong. There are a few datapoints where a couple of females have 0 work experience, but their salaries are rather varied. The salary range is rather large, being nearly 60,000. This may suggest that factors other than work experience contribute to the determination of salary for females, especially when they have no experience at all.
First, we will load the yield curve data file that contains data on the yield curve since 1960-01-01
yield_curve <- read_csv(here::here("data", "yield_curve.csv"))
glimpse(yield_curve) # Examining loaded data for NAs etc## Rows: 6,884
## Columns: 5
## $ date <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01,…
## $ series_id <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS…
## $ value <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, …
## $ maturity <chr> "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", …
## $ duration <chr> "3-Month Treasury Bill", "3-Month Treasury Bill", "3-Month T…
Our dataframe yield_curve has five columns (variables):
date: already a date objectseries_id: the FRED database ticker symbolvalue: the actual yield on that datematurity: a short hand for the maturity of the bondduration: the duration, written out in all its glory!new_yield_curve <- yield_curve
new_yield_curve$duration <- factor(new_yield_curve$duration,
levels= c("3-Month Treasury Bill", "6-Month Treasury Bill",
"1-Year Treasury Rate", "2-Year Treasury Rate",
"3-Year Treasury Rate", "5-Year Treasury Rate",
"7-Year Treasury Rate", "10-Year Treasury Rate",
"20-Year Treasury Rate", "30-Year Treasury Rate") )
ggplot(new_yield_curve, aes(x = date, y = value, color = duration)) +
geom_line() +
facet_wrap(vars(duration), ncol=2) +
theme_bw()+
theme(legend.position = "none") +
labs(title= "Yields on U.S. Treasury rates since 1960",
y= "%")library(lubridate)
monthly_yield <- new_yield_curve %>%
mutate(year= year(date),
month= month(date)) %>%
filter(year>=1999) %>%
group_by(year, month)
print.data.frame(head(monthly_yield))## date series_id value maturity duration year month
## 1 1999-01-01 TB3MS 4.34 3m 3-Month Treasury Bill 1999 1
## 2 1999-02-01 TB3MS 4.44 3m 3-Month Treasury Bill 1999 2
## 3 1999-03-01 TB3MS 4.44 3m 3-Month Treasury Bill 1999 3
## 4 1999-04-01 TB3MS 4.29 3m 3-Month Treasury Bill 1999 4
## 5 1999-05-01 TB3MS 4.50 3m 3-Month Treasury Bill 1999 5
## 6 1999-06-01 TB3MS 4.57 3m 3-Month Treasury Bill 1999 6
ggplot(monthly_yield, aes(x= maturity, y= value, group = month, color=year)) +
geom_line() +
facet_wrap(vars(year), ncol= 4)+
theme(legend.position = "none") +
scale_x_discrete(limits = c("3m","6m", "1y","2y", "3y", "5y", "7y", "10y",
"20y", "30sy")) +
labs(title= "U.S. Yield Curve",
y= "Yield (%)") +
NULLthree_ten_yield <- monthly_yield %>%
filter(duration == "3-Month Treasury Bill" |
duration == "10-Year Treasury Rate")
print.data.frame(head(three_ten_yield))## date series_id value maturity duration year month
## 1 1999-01-01 TB3MS 4.34 3m 3-Month Treasury Bill 1999 1
## 2 1999-02-01 TB3MS 4.44 3m 3-Month Treasury Bill 1999 2
## 3 1999-03-01 TB3MS 4.44 3m 3-Month Treasury Bill 1999 3
## 4 1999-04-01 TB3MS 4.29 3m 3-Month Treasury Bill 1999 4
## 5 1999-05-01 TB3MS 4.50 3m 3-Month Treasury Bill 1999 5
## 6 1999-06-01 TB3MS 4.57 3m 3-Month Treasury Bill 1999 6
ggplot(three_ten_yield, aes(x=date, y= value, color=duration))+
geom_line()+
theme_bw() +
labs(title= "Yields on 3-month and 10-year US Treasury rates since 1999",
y= "%") +
NULLAccording to Wikipedia’s list of recession in the United States, since 1999 there have been two recession in the US: between Mar 2001–Nov 2001 and between Dec 2007–June 2009. However, the plots here seem to show that the drastic changes in yields occure after the given start date of the recession, and also that there are many other periods of the yield curve flattening which are not attached to any recession.
# get US recession dates after 1946 from Wikipedia
# https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States
recessions <- tibble(
from = c("1948-11-01", "1953-07-01", "1957-08-01", "1960-04-01",
"1969-12-01", "1973-11-01", "1980-01-01","1981-07-01", "1990-07-01",
"2001-03-01", "2007-12-01","2020-02-01"),
to = c("1949-10-01", "1954-05-01", "1958-04-01", "1961-02-01", "1970-11-01",
"1975-03-01", "1980-07-01", "1982-11-01", "1991-03-01", "2001-11-01",
"2009-06-01", "2020-04-30")
) %>%
mutate(From = ymd(from),
To=ymd(to),
duration_days = To-From)
print.data.frame(head(recessions))## from to From To duration_days
## 1 1948-11-01 1949-10-01 1948-11-01 1949-10-01 334 days
## 2 1953-07-01 1954-05-01 1953-07-01 1954-05-01 304 days
## 3 1957-08-01 1958-04-01 1957-08-01 1958-04-01 243 days
## 4 1960-04-01 1961-02-01 1960-04-01 1961-02-01 306 days
## 5 1969-12-01 1970-11-01 1969-12-01 1970-11-01 335 days
## 6 1973-11-01 1975-03-01 1973-11-01 1975-03-01 485 days
new_three_ten_yield <- yield_curve %>%
filter(duration == "3-Month Treasury Bill" |
duration == "10-Year Treasury Rate")
updated_three_ten_yield <- new_three_ten_yield %>%
select(date, duration, value) %>%
pivot_wider(names_from=duration,
values_from=value) %>%
janitor::clean_names() %>%
mutate(difference= x10_year_treasury_rate - x3_month_treasury_bill)
updated_recessions <- recessions %>%
filter(year(From) >=1960)
#merged dataframe (updated_recessions & updated_three_ten_yield)
new_data <- merge(updated_three_ten_yield, updated_recessions)
ggplot(new_data, aes(x = date, y = difference))+
geom_line() +
geom_line(aes(y=0), color= "grey3") +
geom_rect(aes(xmin= From, xmax=To, ymin=-3, ymax=5), fill="gray84") +
geom_ribbon(aes(ymin=pmin(difference,0), ymax=0), fill="indianred3",
alpha=0.5) +
geom_ribbon(aes(ymin=0, ymax=pmax(difference,0)), fill="skyblue3",
alpha=0.5) +
geom_rug(data = subset(updated_three_ten_yield, difference < 0 ),
color="indianred3", sides="b") +
geom_rug(data = subset(updated_three_ten_yield, difference >= 0 ),
color="skyblue3", sides="b") +
theme(legend.position="none") +
theme_minimal() +
labs (
title = "Yield Curve Inversion: 10-year minus 3-month U.S.Treasury rates",
x = "",
y = "Difference (10 year - 3 month) yield in %",
) The above plot shows that recessions can also precede inversion, and so there may not be as strong a link as suggested.
At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports).
The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP.
UN_GDP_data <- read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
sheet="Download-GDPconstant-USD-countr", # Sheet name
skip=2) # Number of rows to skip
print.data.frame(head(UN_GDP_data))## CountryID Country
## 1 4 Afghanistan
## 2 4 Afghanistan
## 3 4 Afghanistan
## 4 4 Afghanistan
## 5 4 Afghanistan
## 6 4 Afghanistan
## IndicatorName
## 1 Final consumption expenditure
## 2 Household consumption expenditure (including Non-profit institutions serving households)
## 3 General government final consumption expenditure
## 4 Gross capital formation
## 5 Gross fixed capital formation (including Acquisitions less disposals of valuables)
## 6 Exports of goods and services
## 1970 1971 1972 1973 1974 1975 1976 1977
## 1 5.56e+09 5.33e+09 5.20e+09 5.75e+09 6.15e+09 6.32e+09 6.37e+09 6.90e+09
## 2 5.07e+09 4.84e+09 4.70e+09 5.21e+09 5.59e+09 5.65e+09 5.68e+09 6.15e+09
## 3 3.72e+08 3.82e+08 4.02e+08 4.21e+08 4.31e+08 5.98e+08 6.27e+08 6.76e+08
## 4 9.85e+08 1.05e+09 9.19e+08 9.19e+08 1.18e+09 1.37e+09 2.03e+09 1.92e+09
## 5 9.85e+08 1.05e+09 9.19e+08 9.19e+08 1.18e+09 1.37e+09 2.03e+09 1.92e+09
## 6 1.12e+08 1.45e+08 1.73e+08 2.18e+08 3.00e+08 3.16e+08 4.17e+08 4.31e+08
## 1978 1979 1980 1981 1982 1983 1984 1985
## 1 7.09e+09 6.92e+09 6.69e+09 6.81e+09 6.96e+09 7.30e+09 7.43e+09 7.45e+09
## 2 6.30e+09 6.15e+09 5.95e+09 6.06e+09 6.19e+09 6.49e+09 6.61e+09 6.63e+09
## 3 7.25e+08 7.08e+08 6.85e+08 6.97e+08 7.12e+08 7.47e+08 7.60e+08 7.63e+08
## 4 2.22e+09 2.07e+09 1.98e+09 2.06e+09 2.08e+09 2.19e+09 2.23e+09 2.23e+09
## 5 2.22e+09 2.07e+09 1.98e+09 2.06e+09 2.08e+09 2.19e+09 2.23e+09 2.23e+09
## 6 4.57e+08 4.89e+08 4.53e+08 4.60e+08 4.77e+08 4.96e+08 5.06e+08 5.08e+08
## 1986 1987 1988 1989 1990 1991 1992 1993
## 1 7.68e+09 6.89e+09 6.32e+09 5.87e+09 5.69e+09 5.28e+09 5.59e+09 4.36e+09
## 2 6.83e+09 6.12e+09 5.62e+09 5.22e+09 5.06e+09 4.70e+09 4.97e+09 3.87e+09
## 3 7.85e+08 7.05e+08 6.46e+08 6.01e+08 5.82e+08 5.40e+08 5.72e+08 4.46e+08
## 4 2.30e+09 2.07e+09 1.90e+09 1.76e+09 1.71e+09 1.51e+09 1.52e+09 1.13e+09
## 5 2.30e+09 2.07e+09 1.90e+09 1.76e+09 1.71e+09 1.51e+09 1.52e+09 1.13e+09
## 6 5.23e+08 4.69e+08 4.31e+08 4.00e+08 3.88e+08 4.15e+08 4.92e+08 4.22e+08
## 1994 1995 1996 1997 1998 1999 2000 2001
## 1 3.52e+09 5.46e+09 5.36e+09 5.25e+09 5.18e+09 5.09e+09 4.95e+09 4.70e+09
## 2 3.13e+09 4.86e+09 4.76e+09 4.67e+09 4.60e+09 4.52e+09 4.41e+09 4.17e+09
## 3 3.59e+08 5.60e+08 5.48e+08 5.36e+08 5.33e+08 5.17e+08 5.04e+08 4.95e+08
## 4 8.70e+08 1.29e+09 1.21e+09 1.14e+09 1.08e+09 1.02e+09 9.53e+08 1.00e+09
## 5 8.70e+08 1.29e+09 1.21e+09 1.14e+09 1.08e+09 1.02e+09 9.53e+08 1.00e+09
## 6 3.69e+08 6.16e+08 6.42e+08 6.64e+08 6.87e+08 7.04e+08 7.13e+08 6.54e+08
## 2002 2003 2004 2005 2006 2007 2008 2009
## 1 7.18e+09 8.87e+09 8.73e+09 1.01e+10 1.07e+10 1.20e+10 1.23e+10 1.29e+10
## 2 6.40e+09 7.89e+09 7.66e+09 9.00e+09 9.34e+09 1.04e+10 1.06e+10 1.10e+10
## 3 7.02e+08 9.06e+08 1.05e+09 1.06e+09 1.40e+09 1.71e+09 1.73e+09 2.15e+09
## 4 1.37e+09 1.54e+09 1.90e+09 2.06e+09 2.06e+09 2.17e+09 2.14e+09 3.12e+09
## 5 1.37e+09 1.54e+09 1.90e+09 2.06e+09 2.06e+09 2.17e+09 2.14e+09 3.12e+09
## 6 9.49e+08 1.41e+09 1.11e+09 1.14e+09 1.09e+09 1.03e+09 1.24e+09 1.53e+09
## 2010 2011 2012 2013 2014 2015 2016 2017
## 1 1.79e+10 1.97e+10 2.91e+10 3.48e+10 3.35e+10 3.53e+10 3.50e+10 3.51e+10
## 2 1.57e+10 1.70e+10 2.59e+10 3.14e+10 3.02e+10 3.19e+10 3.16e+10 3.16e+10
## 3 2.25e+09 2.69e+09 2.81e+09 2.81e+09 2.76e+09 2.81e+09 2.84e+09 2.85e+09
## 4 2.81e+09 2.61e+09 2.85e+09 2.75e+09 2.13e+09 2.29e+09 2.34e+09 2.24e+09
## 5 2.81e+09 2.61e+09 2.85e+09 2.75e+09 2.13e+09 2.29e+09 2.34e+09 2.24e+09
## 6 1.58e+09 1.72e+09 1.31e+09 8.34e+08 1.20e+09 9.10e+08 7.54e+08 7.60e+08
The first thing we need to do is to tidy the data, as it is in wide format and so we must make it into long, tidy format, and express all figures in billions (divide values by 1e9, or \(10^9\)), and you want to rename the indicators into something shorter.
## [1] 51
tidy_GDP_data <- UN_GDP_data %>%
pivot_longer(cols = 4:51, #columns 4 to 51
names_to = "Year",
values_to = "Country_GDP") %>%
mutate(Country_GDP = Country_GDP/1e9)
print.data.frame(head(tidy_GDP_data))## CountryID Country IndicatorName Year Country_GDP
## 1 4 Afghanistan Final consumption expenditure 1970 5.56
## 2 4 Afghanistan Final consumption expenditure 1971 5.33
## 3 4 Afghanistan Final consumption expenditure 1972 5.20
## 4 4 Afghanistan Final consumption expenditure 1973 5.75
## 5 4 Afghanistan Final consumption expenditure 1974 6.15
## 6 4 Afghanistan Final consumption expenditure 1975 6.32
country_list <- c("United States","India", "Germany") #Compare GDP components for these 3 countries
unique(tidy_GDP_data[c("IndicatorName")])## # A tibble: 17 × 1
## IndicatorName
## <chr>
## 1 Final consumption expenditure
## 2 Household consumption expenditure (including Non-profit institutions serving…
## 3 General government final consumption expenditure
## 4 Gross capital formation
## 5 Gross fixed capital formation (including Acquisitions less disposals of valu…
## 6 Exports of goods and services
## 7 Imports of goods and services
## 8 Gross Domestic Product (GDP)
## 9 Agriculture, hunting, forestry, fishing (ISIC A-B)
## 10 Mining, Manufacturing, Utilities (ISIC C-E)
## 11 Manufacturing (ISIC D)
## 12 Construction (ISIC F)
## 13 Wholesale, retail trade, restaurants and hotels (ISIC G-H)
## 14 Transport, storage and communication (ISIC I)
## 15 Other Activities (ISIC J-P)
## 16 Total Value Added
## 17 Changes in inventories
First, we will reproduce this plot:
raw_GDP_component_list <- c("Gross capital formation", "Exports of goods and services", "General government final consumption expenditure", "Household consumption expenditure (including Non-profit institutions serving households)", "Imports of goods and services", "Gross Domestic Product (GDP)")
GDP_component_list <- c("Gross capital formation", "Exports", "Government expenditure", "Household expenditure","Imports","GDP")
three_countries_data <- tidy_GDP_data %>%
filter(Country %in% country_list) %>%
mutate(IndicatorName = factor(IndicatorName,
levels = raw_GDP_component_list,
labels = GDP_component_list)) %>%
filter(IndicatorName %in% GDP_component_list)
three_countries_data_no_GDP <- three_countries_data %>%
filter(IndicatorName != "GDP")
#head(plot_three_countries)
plot_three_countries <- ggplot(three_countries_data_no_GDP, aes(x = Year,
y = Country_GDP,
group = IndicatorName,
colour = IndicatorName))+
scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
geom_line(size = 1) +
facet_wrap(vars(Country), scales = "free_x") +
theme_bw()+
labs (
title = "GDP components over time",
subtitle = "In constant 2010 USD",
x = "",
y = "Billion US$" ) +
guides(color=guide_legend(title="Components of GDP")) +
NULL
plot_three_countriesGDP_component_cleaned_list <- c("Gross_capital_formation", "Exports", "Government_expenditure", "Household_expenditure","Imports","GDP")
original_GDP_data <- three_countries_data %>%
mutate(IndicatorName = factor(IndicatorName,
levels = GDP_component_list,
labels = GDP_component_cleaned_list)) %>%
pivot_wider(names_from = IndicatorName,
values_from = Country_GDP) %>%
mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
#GDP_proportion = abs(GDP_calculated - GDP)/GDP,
Household_expenditure_proportion = Household_expenditure / GDP,
Government_expenditure_proportion = Government_expenditure / GDP,
Gross_capital_formation_proportion = Gross_capital_formation / GDP,
Net_Exports_proportion = (Exports - Imports) / GDP)
print.data.frame(head(original_GDP_data))## CountryID Country Year Household_expenditure Government_expenditure
## 1 276 Germany 1970 872 280
## 2 276 Germany 1971 919 298
## 3 276 Germany 1972 969 313
## 4 276 Germany 1973 997 332
## 5 276 Germany 1974 995 351
## 6 276 Germany 1975 1032 366
## Gross_capital_formation Exports Imports GDP GDP_calculated
## 1 442 174 188 1534 1581
## 2 444 178 202 1582 1638
## 3 452 189 215 1650 1709
## 4 464 210 222 1729 1779
## 5 419 234 223 1744 1775
## 6 392 220 230 1729 1780
## Household_expenditure_proportion Government_expenditure_proportion
## 1 0.568 0.183
## 2 0.581 0.188
## 3 0.587 0.190
## 4 0.576 0.192
## 5 0.570 0.201
## 6 0.597 0.212
## Gross_capital_formation_proportion Net_Exports_proportion
## 1 0.288 -0.00883
## 2 0.280 -0.01471
## 3 0.274 -0.01534
## 4 0.268 -0.00743
## 5 0.240 0.00623
## 6 0.226 -0.00580
Secondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in your dataframe, I would like you to calculate it given its components discussed above.
What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe?
#Find difference between actual and calculated GDP
diff_GDP_data <- three_countries_data %>%
mutate(IndicatorName = factor(IndicatorName,
levels = GDP_component_list,
labels = GDP_component_cleaned_list)) %>%
pivot_wider(names_from = IndicatorName,
values_from = Country_GDP) %>%
mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
GDP_diff_proportion = abs(GDP_calculated - GDP)/GDP)
plot_diff<- ggplot(diff_GDP_data, aes(x = Year, y = GDP_diff_proportion, group = Country, colour = Country)) +
scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
geom_line(size = 0.9) +
theme_bw()+
labs (
title = "% difference between what calculated as GDP and the GDP figure",
x = "",
y = "proportion" ) +
guides(color=guide_legend(title="Country")) +
NULL
plot_diffIndia has big difference, others not, having had 2% difference in latest year, maybe as bussiness investment (large part of India’s GDP) can be reported very differently to other countries listed, due to various differences in legal frameworks and reporting regulations.
Next, we are reproducing the above plot using the following code:
#Component as a proportion of original GDP (replicating the graph)
plot_GDP_percent_diff <- original_GDP_data %>%
pivot_longer(cols = 11:14, #columns 11 to 15
names_to = "Component",
values_to = "Proportion")
ggplot(plot_GDP_percent_diff, aes(x = Year,
y = Proportion,
group = Component,
colour = Component))+
scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
geom_line(size = 0.9) +
facet_wrap(vars(Country), scales = "free_x") +
theme_bw()+
labs (
title = "GDP components over time",
x = "",
y = "proportion") +
guides(color=guide_legend(title="")) +
NULL#Component as a proportion of calculated GDP
calculated_GDP_data <- three_countries_data %>%
mutate(IndicatorName = factor(IndicatorName,
levels = GDP_component_list,
labels = GDP_component_cleaned_list)) %>%
pivot_wider(names_from = IndicatorName,
values_from = Country_GDP) %>%
mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
Household_expenditure_proportion = Household_expenditure / GDP_calculated,
Government_expenditure_proportion = Government_expenditure / GDP_calculated,
Gross_capital_formation_proportion = Gross_capital_formation / GDP_calculated,
Net_Exports_proportion = (Exports - Imports) / GDP_calculated)
plot_calculated_GDP <- calculated_GDP_data %>%
pivot_longer(cols = 11:14, #columns 11 to 15
names_to = "Component",
values_to = "Proportion")
ggplot(plot_calculated_GDP, aes(x = Year,
y = Proportion,
group = Component,
colour = Component))+
scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
geom_line(size = 0.9) +
facet_wrap(vars(Country), scales = "free_x") +
theme_bw()+
labs (
title = "GDP and its breakdown at constant 2010 prices in US Dollars",
x = "",
y = "proportion" ) +
guides(color=guide_legend(title="")) +
NULL > What is this last chart telling you? Can you explain in a couple of paragraphs the different dynamic among these three countries?
India shows a large increase in investments (related to the post-1991 economic growth, particularly in the tech sector, with foreign investment there having grown massively), also showing household expenditure proportion dropping (development related, as more households have more disposable income), but still has a lower government expenditure)
The US is experiencing net exports dropping, partly due to cheaper manufacturing elsewhere, marking its transition to a service economy, and also drop in government expenditure, perhaps marking the turn towards austerity post-recessions.However, it still shows stability in gross capital formation.
Germany is showing a rise in net exports, odd for a Western nation, and is table on other fronts.
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Yes!